Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFA macro + better NFA matcher #58

Merged
merged 37 commits into from
Apr 4, 2020
Merged

NFA macro + better NFA matcher #58

merged 37 commits into from
Apr 4, 2020

Conversation

nitely
Copy link
Owner

@nitely nitely commented Mar 16, 2020

Changes:

  • Remove epsilon-transitions
  • Refactor regex module into many modules
  • Simplify most matching algorithms
  • Adds benchmarks against nim's re (most LoC changes are because of the input text file)

Needs benchmarking to check for regressions.

closes #57
closes #28

@nitely nitely changed the title Revamp NFA macro + better NFA matcher Mar 24, 2020
@nitely
Copy link
Owner Author

nitely commented Mar 24, 2020

Quick benchmark:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
somere.nim                                      relative  time/iter  iters/s
============================================================================
re_sol                                                       1.10ms   910.91
regex_sol                                         64.79%     1.69ms   590.20
re_nums                                                    171.08ns    5.85M
regex_nums2                                      88.46%   193.39ns    5.17M
re_nums                                                    176.39ns    5.67M
regex_nums                                       95.51%   184.69ns    5.41M

Nim's re VS nim-regex macro-NFA.

@nitely
Copy link
Owner Author

nitely commented Mar 27, 2020

Almost done. I need to implement the find version that takes O(N*M) time, and benchmark it against the new O(N^2) version. It should be faster when N > M assuming it does not add much overhead.

@nitely
Copy link
Owner Author

nitely commented Apr 4, 2020

benchmarks against nim-regex 0.13:

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.87ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
regex_old_sol                                               42.60ms    23.48
regex_sol                                                    1.85ms   541.34
regex_old_nums                                               2.43us  410.78K
regex_nums                                                 206.64ns    4.84M
regex_old_nums2                                              2.92us  342.48K
regex_nums2                                                231.74ns    4.32M
regex_old_lits_find                                         41.71ms    23.97
regex_lits_find                                              1.03ms   968.61
email_old_find_all                                            1.27s  787.08m
email_find_all                                             472.56ms     2.12
dummy                                                        0.00fs      inf

@nitely nitely merged commit 0570009 into master Apr 4, 2020
@nitely nitely deleted the revamp branch April 4, 2020 00:33
@timotheecour
Copy link
Contributor

this sounds amazing!

a few points:

  • 1st benchmark shows:re_nums instead of re_nums2; not sure whether both algos are comparing the same
re_nums                                                    171.08ns    5.85M
regex_nums2                                      88.46%   193.39ns    5.17M
  • maybe benchmark could add comparison against nre as well
  • -d:danger should be used for benchmarking; perhaps also --passc:-flto;
  • more generally optimizations decisions should be made from -d:danger, not -d:release; otherwise you may end up drawing wrong conclusions and end up with suboptimal results in the end
  • maybe add nim 1.2 to CI now that it was released

Remove epsilon-transitions

  • haven't looked at the code change in detail but does that come with any loss of functionality?

@nitely
Copy link
Owner Author

nitely commented Apr 4, 2020

1st benchmark shows:re_nums instead of re_nums2; not sure whether both algos are comparing the same

They are. I've fixed the name.

maybe benchmark could add comparison against nre as well

IIRC, Araq said re is faster than nre, but I can add it, no problem.

-d:danger should be used for benchmarking; perhaps also --passc:-flto;

I did set -d:danger. -d:release is only 10% slower, IIRC.

maybe add nim 1.2 to CI now that it was released

Yeah, I'm using nim's docker. I'm sure it will be added in a few days.

haven't looked at the code change in detail but does that come with any loss of functionality?

No, it's just an optimization that avoids recursion in the matching algorithm.

@thomastay
Copy link

FWIW, I came across this PR when trying to benchmark mariomka for nim. Seems like most of the discussion has already been had in #57, but I wanted to just add in the benchmarks that I measured on my machine. Code is here in case I did something wrong; I am rather new to this repo.

Flags: -d:danger
Nim version: 1.3.1
nim-regex: 0.14.1

Benching with default GC, taking mean of 3 iterations
std/re: 28.0 - 92
std/re: 28.0 - 5301
std/re: 9.0 - 5
nim-regex: 371.0 - 92
nim-regex: 344.3333333333333 - 5301
nim-regex: 438.6666666666667 - 5

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

@nitely
Copy link
Owner Author

nitely commented Apr 26, 2020

find and findAll can be ~10x slower on some regexes due to lack of literal optimization (i.e: a quick memchr is done to find potential matches). I'm aware of it (see #59) and I'll optimize it as soon as I've some free time.

@timotheecour
Copy link
Contributor

timotheecour commented Apr 26, 2020

Also, I don't think a hybrid approach for nim-regex and PCRE would be a good idea, since the selling point of nim-regex is that it's a pure Nim implementation.

I'm not buying that argument.

At the end of the day, we should do what's practical and useful. Certainly nim-regex can have a pure nim implementation that keeps improving, and, all else being equal (eg performance), using a pure nim implementation is better, but nothing prevents it from benefiting from optimizations which could be controlled by some flags or library options.

This allows user code to stay identical (offering same interface) regardless of whether PCRE or other library is used (implementation detail).

Very concretely, the main point of pure nim implementation is making it available on all backends/environments (eg nim vm, maybe nim js or nimscript, or in cases where PCRE is not available); but, where performance matters and PCRE is available and higher performance than the nim engine, there is no practical reason not to benefit from it (other than implementing that bridge, which should be a lot easier than making improvements on the engine itself)

+1 for providing a benchmark though

@nitely
Copy link
Owner Author

nitely commented Jul 16, 2020

@thomastay @timotheecour good news, my regex literal optimization is 34x faster than re for the email regex, and about the same speed for the rest. It's not even the macro version, that one would be faster, but I don't think I'll implement it.

============================================================================
GlobalBenchmark                                 relative  time/iter  iters/s
============================================================================
GlobalBenchmark                                            294.86ps    3.39G
============================================================================
bench.nim                                       relative  time/iter  iters/s
============================================================================
re_email_find_all                                           24.19ms    41.33
email_find_all                                  3389.08%   713.91us    1.40K
re_uri_find_all                                             24.75ms    40.40
uri_find_all                                      92.92%    26.64ms    37.54
re_ip_find_all                                               6.41ms   156.06
ip_find_all                                      101.27%     6.33ms   158.04
runes                                                        5.99ms   167.04
dummy                                                        0.00fs      inf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants